Goto

Collaborating Authors

 exponential dependence






Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

arXiv.org Machine Learning

In this paper we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $ฯ€^*$ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $ฮฒ\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(ฮต,ฮด)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{ฯ€_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $ฯ€_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-ฮณ}$ and the strength of this dependence grows with the learners risk-sensitivity $|ฮฒ|$. We also provide two lower bounds which shows that exponential dependence on $|ฮฒ|\frac{1}{1-ฮณ}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in $\varepsilon$ and $ฮด$ and that the PAC-bound on $Q$-learning is tight in the number of actions $A$, and that the PAC-bound on policy-learning is nearly tight in $A$.


Reviews: Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Neural Information Processing Systems

Generalization bounds on neural nets, based on Rademacher complexity, use the norm bounds on weights of layers, which gives an exponential dependence on depth. Moreover, existing lower bounds show that this is unavoidable (in general). The goal of the paper is to get bounds polynomial in depth by additionally using properties of training data. However, such data dependent bounds comes with challenges, discussed in the paper. The authors introduce "augmenting" the loss function with desirable properties and present tools to derive covering bounds on augmented loss. Comments: 1. Data-dependent generalization bounds have recently become popular to derive sharper generalization bounds. This paper contributes to this line of work by considering properties of training data, in particular norms of layers and norms of Jacobians of laters with other layers. They paper presents the (novel) idea of augmenting the loss function with the desirable properties, they then derive generalization bounds on the augmented loss.


Review for NeurIPS paper: Sinkhorn Barycenter via Functional Gradient Descent

Neural Information Processing Systems

Weaknesses: The constants in the bounds depend linearly on the dimension, although they depends exponentially on the regularization parameter. If Sinkhorn distance is thought as a proxy of the Wasserstein distance, this seems to be a hidden dependance on the dimension, since the regularization parameter plays the role of an interpolation between MMD and Wasserstein distances, and MMD distances are more blind to the dimension. This is not discussed in the paper. The results also have an exponential dependence on an assumed uniform upper bound on the cost. For the classical quadratic cost, this imply an exponential dependence on the dimension for the case of measures supported on [0,1] d for instance.


A Regularized Online Newton Method for Stochastic Convex Bandits with Linear Vanishing Noise

arXiv.org Machine Learning

We study a stochastic convex bandit problem where the subgaussian noise parameter is assumed to decrease linearly as the learner selects actions closer and closer to the minimizer of the convex loss function. Accordingly, we propose a Regularized Online Newton Method (RONM) for solving the problem, based on the Online Newton Method (ONM) of arXiv:2406.06506. Our RONM reaches a polylogarithmic regret in the time horizon $n$ when the loss function grows quadratically in the constraint set, which recovers the results of arXiv:2402.12042 in linear bandits. Our analyses rely on the growth rate of the precision matrix $\Sigma_t^{-1}$ in ONM and we find that linear growth solves the question exactly. These analyses also help us obtain better convergence rates when the loss function grows faster. We also study and analyze two new bandit models: stochastic convex bandits with noise scaled to a subgaussian parameter function and convex bandits with stochastic multiplicative noise.


Reviews: Empirical Risk Minimization in Non-interactive Local Differential Privacy Revisited

Neural Information Processing Systems

In this setting, each user (holding one data point) is required to send a differentially private signal to the server without any prior interaction with the server or other users. Then, the server collects the users' signals and uses them to solve the ERM problem. The most relevant previous work is [19] that shows that any protocol that is based on first (or second) order methods (e.g., gradient descent and other variants) must require sample size \Omega(\alpha {-p}) if it were to achieve error \alpha (where p is the dimensionality of the parameter space). This reference also gives upper bounds of the same order for non-interactive ERM under Local Differential Privacy (LDP) for the class of Lipschitz loss functions and the class of Lipschitz, convex loss functions. This paper revisits this problem under some smoothness assumptions on the loss function, and devises new algorithms for this problem based on polynomial approximation techniques.


Meta Reinforcement Learning with Finite Training Tasks -- a Density Estimation Approach

arXiv.org Artificial Intelligence

In meta reinforcement learning (meta RL), an agent learns from a set of training tasks how to quickly solve a new task, drawn from the same task distribution. The optimal meta RL policy, a.k.a. the Bayes-optimal behavior, is well defined, and guarantees optimal reward in expectation, taken with respect to the task distribution. The question we explore in this work is how many training tasks are required to guarantee approximately optimal behavior with high probability. Recent work provided the first such PAC analysis for a model-free setting, where a history-dependent policy was learned from the training tasks. In this work, we propose a different approach: directly learn the task distribution, using density estimation techniques, and then train a policy on the learned task distribution. We show that our approach leads to bounds that depend on the dimension of the task distribution. In particular, in settings where the task distribution lies in a low-dimensional manifold, we extend our analysis to use dimensionality reduction techniques and account for such structure, obtaining significantly better bounds than previous work, which strictly depend on the number of states and actions. The key of our approach is the regularization implied by the kernel density estimation method. We further demonstrate that this regularization is useful in practice, when `plugged in' the state-of-the-art VariBAD meta RL algorithm.